Implement multi-site local sensitivity analysis workflow by divine7022 · Pull Request #1 · ccmmf/uncertainty

divine7022 · 2025-11-17T17:35:49Z

Summary:

Implements a complete multi-site local sensitivity analysis (SA) workflow for SIPNET as specified in #151. This PR delivers aggregated sensitivity results across design points, environmental gradient analysis, and a Quarto-based report summarizing parameter importance patterns.

Addresses Issue #151 Requirements

This reverts commit 68e4be6.

This reverts commit d82f2d6.

…elasticity and variance explained; and add plot_pdp_sensitivity func

dlebauer · 2025-11-21T03:46:05Z

BTW this looks really nice, well done. Though I'd like to work through the steps as I review.

Could you please add a readme with instructions on how to use the repository?

divine7022 · 2025-11-26T04:00:44Z

thanks for reviewing, updated readme in #2

infotroph

Just realized I left these comments unposted on my partial readthrough the other day, so adding now. Would it be useful for me to finish a full review now or wait for updates first?

infotroph · 2025-11-19T21:24:35Z

scripts/012_aggregate_sensitivity.R

+#----------------------------------
+# read.settings() uses xmlToList loses duplicate <variable> tags, so read XML directly
+library(XML)
+xml_doc <- XML::xmlParse(file.path("output", "pecan.CONFIGS.xml"))
+sa_variables <- unique(XML::xpathSApply(
+  xml_doc,
+  "//sensitivity.analysis//variable",
+  XML::xmlValue
+))
+XML::free(xml_doc)


Suggested change

#----------------------------------

# read.settings() uses xmlToList loses duplicate <variable> tags, so read XML directly

library(XML)

xml_doc <- XML::xmlParse(file.path("output", "pecan.CONFIGS.xml"))

sa_variables <- unique(XML::xpathSApply(

xml_doc,

"//sensitivity.analysis//variable",

XML::xmlValue

))

XML::free(xml_doc)

sa_variables <- settings$sensitivity.analysis |>

(\(x) x[names(x) == "variable"])() |>

unlist() |>

unique()

infotroph · 2025-11-19T21:41:39Z

scripts/001_setup_design_points.R

@@ -0,0 +1,108 @@
+#!/usr/bin/env Rscript


Does this site selection process differ from what David already implemented in the downscaling repo? Seems preferable to use an existing tool rather than add that complexity to the UA process.

I have refactored this script to completely remove the clustering logic. Instead it now acts purely as a consumer of the shared design_points.csv artifact (currently the david's manually selected 198 sites) to perform the necessary PFT mapping and sub-sampling for UA.

We will treat this as the interface for now, and can revisit or refactor the integration logic once the broader coordination strategy between the Downscaling and Uncertainty workflows is finalized.

see comment #1 (comment)

infotroph · 2025-11-20T17:44:43Z

R/local_sensitivity.R

@@ -0,0 +1,497 @@
+#' Aggregate local sensitivity results across sites


I'm not sure what "local" means here

good catch regarding the ambiguity

In this context, 'Local' refers to the mathematical method used by pecan's variance decomposition one-at-a-time (OAT) SA, where parameters are varied individually around their median to calculate partial derivatives (elasticities). This is distinct from the global sensitivity analysis we perform later.

I have updated the function documentation to explicitly state this aggregates one-at-a-time (OAT) parameter sensitivity results to avoid confusion with geographical locality.

infotroph · 2025-11-20T18:01:19Z

R/local_sensitivity.R

+#' Aggregate local sensitivity results across sites
+#'
+#' @param sensitivity_outdir Directory containing PEcAn sensitivity outputs
+#' @param design_points Data frame of site metadata


I recommend documenting what columns are expected in this df, and whether unexpected columns are ignored or cause errors

👍 documented

infotroph · 2025-11-20T18:04:12Z

R/local_sensitivity.R

+  )
+
+  # Process all files
+  all_results <- purrr::map_dfr(sa_files, function(sa_file) {


This will be much easier to read and understand if you define it as a function with an informative name -- yes, even if it's only called once as all_results <- purrr::map_dfr(sa_files, name_of_function)

I have extracted the file processing logic into a standalone helper function process_sa_file

divine7022 · 2025-11-26T06:54:36Z

Would it be useful for me to finish a full review now or wait for updates first?

Thanks for the comments,
@infotroph let me address these first, you can continue the full review once I push the next update

…rely as a consumer of the shared design_points.csv artifact

…t list extraction

divine7022 · 2025-11-26T11:54:27Z

scripts/001_setup_design_points.R

+# Currently, this script consumes the 'design_points.csv' from the shared
+# directory. At this stage of the project, these are MANUALLY SELECTED 
+# points (198 sites), not yet the output of the automated clustering 
+# workflow.
+#
+# TODO: Once the integration architecture between the Downscaling and 
+# Uncertainty repos is finalized, refactor this script, continue to consume 
+# the artifact generated by that pipeline.
+# =======================================================================


dlebauer

Overall, this is very nice work. The code is clear and easy to understand.

I've read through the scripts and R functions. My next step will be to review the report. I've made a few comments. Not all need to be done now, but please write tickets or todos to capture future work.

dlebauer · 2025-11-21T03:44:01Z

000-config.yml

  pecan_xml_template: "data_raw/template.xml"
  sites:
    design_points_file: "data_raw/sa_design_points.csv"
+  sensitivity:


The goal with this file was to handle settings that aren't in the pecan xml. Is there a reason that these are included here rather than the template.xml?

Minimizing config options here, and providing sensible defaults in the template.xml could make it more clear to end users.

dlebauer · 2026-02-03T23:38:41Z

000-config.yml

@@ -8,8 +8,19 @@ default:
    raw_data_dir: "data_raw"
    cache_dir: "cache"
    pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701"


Once we have ccmmf_dir, can we change this to use that as a variable? That way it is only necessary to change the system-specific path once. (also applies to master_design points etc. )

Suggested change

pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701"

pecan_outdir: "$ccmmf_dir/modelout/ccmmf_phase_2b_mixed_pfts_20250701"

dlebauer · 2026-02-04T16:59:10Z

000-config.yml

    raw_data_dir: "data_raw"
    cache_dir: "cache"
    pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701"
+    master_design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv"


nit: these are just design points ...

Suggested change

master_design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv"

design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv"

dlebauer · 2026-02-21T21:22:16Z

scripts/001_setup_design_points.R

+cfg <- config::get(file = "000-config.yml")
+
+# Define paths based on config
+master_file <- cfg$paths$master_design_points


These may seem like trivial naming requests, but clarity and specificity is important.

why is this 'master_design_points' ... that implies that there are other sets of design points.

I would call this object design_points_csv instead of master_file

instead of master_data, I would call that object design_points.

dlebauer · 2026-02-21T21:23:22Z

scripts/001_setup_design_points.R

+    )
+  ) |>
+  dplyr::filter(!is.na(pft)) |>
+  dplyr::slice_sample(n = n_sample) |>


why are we sampling here?

dlebauer · 2026-02-21T21:24:01Z

scripts/001_setup_design_points.R

+# Prepares design points for Uncertainty Analysis.
+#
+# Currently, this script consumes the 'design_points.csv' from the shared
+# directory. At this stage of the project, these are MANUALLY SELECTED 


these were not manually selected, they were from the site selection

dlebauer · 2026-02-21T21:26:44Z

scripts/001_setup_design_points.R

+#!/usr/bin/env Rscript
+
+# =======================================================================
+# 001_setup_design_points.R


The fact that this script exists makes me wonder whether the clustering workflow that generates design_points.csv should be updated - that will help ensure consistency in workflows that consume it.

dlebauer · 2026-02-21T21:30:14Z

scripts/002_build_xml.R

+raw_data_dir <- cfg$paths$raw_data_dir
+
+# SA configuration (configurable paths)
+options <- list(


it is unclear - why are configs hard coded here if they are already in the config file and/or template.xml?

dlebauer · 2026-02-21T21:35:20Z

scripts/002_build_xml.R

+)
+site_info <- site_info |> dplyr::rename(id = site_id)
+
+# Read template settings


consider - what is the minimum amount of information that we want the end users to be able to configure, and why?

In addition to file paths, I can see n_sample, start_date, end_date being useful for being able to trigger 'development' mode.

dlebauer · 2026-02-21T21:38:51Z

scripts/011_run_local_sensitivity.R

+  dir.create(settings$outdir, recursive = TRUE)
+}
+
+# Handle workflow resumption


is it worth the effort and overhead to manage a STATUS file? My recollection is that the STATUS file was developed primarily for the PHP web interface to display. It seems that dropping the status file would simplify the code a lot. What benefit does it provide, if any?

This is a context where targets might be useful.

dlebauer

Overall the report is a nice analysis. A few changes could make the results easier to interpret.

Read and implement "turning tables into graphs"
Lead with a 3–5 bullet summary of the top findings (what parameters are most important to calibrate, how do these vary w/ environmental drivers).
display all of the elasticity plots at once using faceting.
show variance across sites in elasticity plot
Show sites on a map similar to the one in showing the design points in the downscaling repo
Replace site table with a rank‑distribution plot (median + IQR) and link to downloadable CSV.
Make gradient results a small multiple of partial dependence plots for top 3 parameters, rather than a long table.

dlebauer · 2026-02-21T22:07:01Z